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Abstract 

The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of 
increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic 
acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we 
performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of 
physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting 
algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning 
models which can predict HA subtypes, various Decision Tree, Support Vector Machine, NaTve Bayes, and Neural Network 
models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute 
weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross 
validation. The results highlighted the frequency of Gin (selected by 80% of attribute weighting algorithms), percentage/ 
frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as 
the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of 
SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein 
attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza 
virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less 
energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus 
through supervised algorithms represent a new avenue for understanding and predicting possible future structure of 
influenza pandemics. 
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Introduction 

Increasing host range of Influenza A virus is a major concern 
worldwide [1,2]. Avian H5N1 influenza has been infecting 
humans zoonotically since 1997, resulting in a high mortality rate 
and there were fears that it might cause the first pandemic of the 
21st century. Recent 2013 human infection with a novel avian- 
origin Influenza A (H7N9) virus with pandemic potential has 
caused significant concern with more than 130 human cases of 
severe infection in China and 43 fatalities [3—6] . It should be noted 
that due to non-existence of previous host immunity, emergence of 
new broad host range of influenza strain with the ability of human- 
to-human transmission could result in a pandemic with millions of 
fatalities [7,8]. 

Influenza A virus evolve in a complex manner with high 
frequency of genomic alteration, mainly because of: (1) its intrinsic 



segmented RNA genomic structure facilitating high frequency of 
genetic reassortment and antigenic drift [9], (2) circulating 
different subtypes of HA and NA as available genetic materials 
for genetic alteration and pandemic induction [10], and (3) 
availability of hosting environments such as swine as a shelter 
against humans or birds vaccines and as a mixing vessel for 
generating reassorted viruses [10,1 1]. 

Surface glycoproteins, haemagglutinin (HA) and neuraminidase 
(NA), determine subtypes of influenza A virus of which 16 HA 
subtypes and 9 NA subtypes have been identifled [12-14]. The 
HA and NA genes are extremely variable in sequence, and less 
than 30% of the amino acids are conserved among aU subtypes. 
Host range/specificity of influenza differs due to its surface 
proteins. HA is the key responsible of viral infectivity and 
specificity [15,16]. When virus releases from host cells, NA 
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catalyzes the cleavage of a-ketosidic linkage between a terminal 
sialic acid and an adjacent D-galactose [9]. 

Alteration of viral surface proteins to recognize a range of host 
receptors is the strategy by which influenza increases its host range 
[2,14]. Interestingly, despite its complexity, recent studies shows 
that a few amino acid substitutions (4-5) has the potential of 
altering A/Indonesia/5/2005 avian A/H5N1 and A/Vietnam/ 
1203/2004 A/H5N1 to be transmissible between ferrets via 
respirator)' droplets [2,17-19]. Converting nonlethal H5N1 
influenza virus isolated from a human to a lethal virus in mice 
happened by a single amino acid substitution from glutamic acid 
to lysine at the position 627 of the PB2 protein [20]. While the 
origin isolate solely replicates in respiratory organs, the lethal 
isolate has the ability to replicate in a variety of organs and 
produce systemic infection [20] . It seems that amino acid profiling 
of surface influenza proteins has the potential to monitor the host 
specificity. 

AUignment based methods including similarity search (BLAST), 
sequence alignment, and clustering have the abUity to cluster HA 
sequences with high accuracy. However, these unsupervised 
models only work with one feature of sequence similarity and do 
not provide knowledge in deeper levels of functional/ structural 
protein architecture. In contrast, application of supervised models 
provides the possibiUty of large scale rule discovery (associated to 
label variable) where this opportunity is restricted in common 
unsupervised (clustering) methods. The extracted rules and 
knowledge can lead to unravel underlying structure of HA 
differentiation. Rccc'ntly, a great attention has been paid to 
supervised machine learning methods implementing div(;rse amino 
acid composition and physic-chemical properties to unravel the 
underlying layers of protein function [21-31]. Mining of structural 
amino acid features have the potential to reflect these differences 
and lead us to specific changes which make a considerable impact 
on protein structure. 

The determination of protein characteristics of HA subtypes in 
a comprehensive survey can provide a new vista for understanding 
the evolution of influenza based on the modulation of protein 
characteristics. Recent achievements in developing influenza 
forecasting mathematical models based on mining of massive 
number of emergency department visits [32] or monitoring the 
health-seeking behaviour of millions of users around the world in 
the form of queries to online search engines (particularly Google) 
[7] reinforce the necessity of large scale investigations in pattern 
recognition and modelling of influenza. 

Machine learning methods have three main steps. The first step 
is extracting the n-dimensional features vector (which is composed 
of descriptors derived from the protein sequences in order to 
reflect difierent aspects of structural and physic-chemical proper- 
ties of each protein) with a class label attached. Various sets of 
protein features including amino acid compositions, dipeptide 
compositions, pseudo amino acid compositions, normalized 
Moreau-Broto autocorrelation, Moran autocorrelation, Geary 
autocorrelation, and recentiy, distribution of various structural 
and physic-chemical properties have been used to make protein 
features vector [25,31,33-36]. Studies considering the impact of 
the different protein features in predicting protein function have 
demonstrated that the combination of protein features and 
considering features such as dipeptides gives a significantiy higher 
performance than the use of individual protein features [36] . The 
second step of machine learning approach is application of 
machine learning method (or classifier) for prediction of the class 
label of the protein features input [25,26,31,33-36]. Currently, 
many machine learning methods, such as neural networks, support 
vector machine (SVM), and decision trees have been successfully 



developed for the prediction of protein function 
[21,22,25,27,30,37,38]. Each algoritimi may be run with different 
criteria aiming to find important features and predict the function 
based on key announced features. The third step is measuring the 
performance of the prediction method and its validity using 
approaches such cross vahdation technique and independent 
evaluation (IE) datasets [39-50]. 

A decision tree is constructed by looking for regularities in data, 
determining the features to add at the next level of the tree using 
an entropy calculation, and then choosing the feature that 
minimizes the entropy impurity [51,52]. Decision tree is method 
of choice for prediction since it presents hierarchical ranking of 
important features and provides a clear image of diflerential 
protein structure [25]. 

Support Vector Machine (SVM) is a binary classification 
method proposed by Vapnik et.al (1995) which originally designed 
for classification and regression tasks [53]. The SVM method has 
been employed for pattern recognition problems in computational 
biology, including gene expression analysis [54], protein-protein 
interactions [55], protein fold class prediction [37,56], and 
protein-nucleotide interactions [21]. SVM has high performance 
level when high degree of diversity exists in datasets, because 
basically, SVM classifier depends on the support vectors, and the 
classifier function does not influenced by the entire dataset. Due to 
the exploitation of kernel functions, SVM is able to efficiently deal 
with a very large number of features [57]. Regarding the 
construction of a big dataset of protein features in this study, 
SVM is one of the preferred machine learning algorithms. 

Bays and Empirical Bayes algorithms are highly efficient in 
prediction of cases with a Large number of variables but fewer 
observations [58]. Here, the dataset is unbalanced regarding the 
huge number of available sequences of HI and H5 subtypes but 
small number of subtypes such as H15 and HI 6. In this case. Naive 
Bayes based on Bayes conditional probability rule might work well 
in prediction. 

Neural network is a mathematical structure able to process 
information through many connected neurons that respond to 
inputs through modifiable weights, thresholds, and mathematical 
transfer functions [59,60]. The network topology is one of the 
parameters that has a significant effect on the performance of a 
neural network [61]. Neural networks are widely used in a number 
of protein studies including protein secondary structure prediction 
[62], protein-nucleotide interactions [63], protein fold class 
prediction [37] and protein localization prediction [64]. 

Supervised machine learning algorithms have been merely 
applied on raw sequence. These modles have demonstrated high 
capability in rule discovery, finding hot spots on sequence, and 
prediction. Attaluri (2009) extracted 2154 sequences of HI, H2, 
H3, Nl and N2 antigens from NCBI and constructed J48 decision 
tree. They successfully identified 78 informati\'e positions for 
detecting HA subtype and 63 positions for detecting NA subtype 
to predit virus subtypes [65] . In another study, protein secjuences 
of all 8 segments of swine and human hosts of 2009 pandemics 
viruses were used to construct SVM model. The developed SVM 
model was able to detect viral host [66]. Association rules have 
been also employed in this context successfully to classify virus host 
between human, avian and swine [67]. 

Recently, we demonstrated that, instead of raw sequence 
analysis, extracting a large number of amino acid attributes and 
utilizing adequate data mining models can result in efficient and 
precise models in predicting the behaviour of malignant and 
benign breast cancer proteins [68], thermostable proteins [23], 
halostable proteins [69], ammonium transporters [70], and protein 
pumps [71]. A large scale analysis of amino acid structural 
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attributes of influenza surfaces proteins rather than raw sequence 
allignment, may provide a clearer image of underlying molecular 
mechanisms of host range increase by detecting the key structural 
protein characteristics which govern HA subtyping. 

The aim of the present study was unraveUing the molecular 
bases of HA subtype dilferences and discovery of the key protein 
characteristics which govern these differences. We used various 
clustering, screening, item set mining and decision tree models to 
determine which protein attributes may be used as a marker to 
differentiate between HA subtypes of influenza A viruses. A large 
scale analysis of computationally calculated protein characteristics 
of HA sequences provided a clear image of the role of simple 
amino acid characteristics in HA-based host differentiation. 
Finding reliable models to predict mutations/reassortments 
responsible for crossing the species barrier via amino acid features 
opens new avenues for prediction of protein structure of possible 
future pandemics. 

Results 

Data Cleaning 

The initial dataset contained 7338 records (HA protein 
sequences) with 896 protein attributes. Of tiiese records, 46.14% 
(3386 records) were classified as HI class, 1.77% (130 records) as 
H2, and 26.06% (1913 records), 2.16% (159 records), 9.85% (723 
records), 3.92% (288 records), 4.68% (344 records), 0.21% (16 
records), 2.46% (181 records), 1.10% (81 records), 0.96% (71 
records), 0.35% (26 records), 0.19% (14 records), 0.04% (3 
records), 0.04% (3 records) were classified as H3 to HI 6, 
respectively. For each record, 868 features remained following 
removal of duphcates, useless attributes, and correlated features 
(dataset is available upon request). 

When the number of variables (attributes) is sufficiently large, 
the ability to process units significantly reduces. Data cleansing 
algorithms have been used previously to remove correlated, 
superfluous or duplicated attributes and consequently to generate 
much smaller databases [27,28,72]. Following application of 
similar algorithms, 5% of the attributes were discarded from the 
original dataset. 

Attribute Weighting 

Data were normalised before running the models; consequently, 
all weights would be between 0 and 1 . The result of application of 
10 different attribute weighting algorithms is presented in Table 
S4. In this table, the weight closer to 1 shows high correspondence 
between the certain protein feature and target variable (HA 
subtypes). In other words, each weight shows the importance of 
each attribute regarding the target label based on its attribute 
weighting algorithm. As mentioned before, an attribute was 
assumed important if that attribute received weight higher than 
0.5 (>0.5) by a certain attribute weighting algorithm (Table S4). 

Weighting by PCA. Only one attribute, non-reduced Cys 
extinction coeflicient at 280 nm significantly weighed (equal to 
1.0, Table S4). 

Weighting by SVM. The percentage of Met and Cys and the 

frequencies of Asp, Ala and Gin were the five protein attributes 
that gained weights higher than 0.50 (1.0, 0.89, 0.64, 0.54 and 
0.54, respectively) (Table S4). 

Weighting by relief. Thirteen attributes showed weights 
higher than 0.50 when this model was applied to the dataset 
(FCdb). These attributes were the frequencies of Phe - Ala and 
Tyr, the percentage of Tyr, the frequencies of Arg and Gin, the 
count of lie, the percentage of Cys, the frequency of Glu, 



Isoelectric point, the frequency of Asp and Lys and the 
percentages of His and Try (Table S4). 

Weighting by uncertainty. Sixteen protein attributes 

weighed higher than 0.50. The percentage and the frequency of 
Tyr gained the highest values (1.0 and 0.96, respectively) and the 
count of Thr gained the lowest value (0.51). 

Weighting by gini index. Again the percentage and the 
frequency of Tyr weighed the highest (1 .0) and the count of He, the 
frequencies of Arg and Gin, the count of Ser, the frequency of Glu, 
the percentage of Cys, the count of Val and nitrogen, the 
frequency of Asp, the percentage of His, weight, isoelectric point 
and the percentage of Try received the highest weighs as 0.95, 
0.91, 0.87, 0.85, 0.81, 0.78, 0.77, 0.76, 0.72, 0.63, 0.52 and 0.51, 
respectively. 

Weighting by chi squared. The following 1 3 attributes were 

weighted higher than 0.50: the percentage and the frequency of 
Tyr, the frequencies of Gin and Phe, non - reduced Cys extinction 
coeflicient at 280 nm, the frequency of Gly, the percentages of 
Met and Cys, the frequency of Glu, the percentage of Try and Pro, 
the frequency of Ala and the percentage of His. 

Weighting by deviation. We found the non - reduced Cys 
extinction coefficient at 280 nm to be the sole important protein 
attribute gained the weight equal to 1.0. 

Weighting by rule. The frerjuencies of Tyr and Gin were 
among 14 other protein attribut(-s with weighted equal to or 
greater than 0.50 when rule algorithm ran on the dataset. The 
other attributes were: the percentages of Tyr, Trp, Met, the 
frequencies of Phe and Arg, aliphatic index, the frequency of Glu, 
the count of He, the frequency of Asp, the count of Ser, isoelectric 
point and the iix'cjuc'ncy of negatively charged residues. 

Weighting by Information gain. Seventeen protein attri- 
butes weighted equal to or greater than 0.50. They included the 
percentages of Cys and Tyr, the frequencies of Tyr and Glu, the 
count of He, the frequencies of Glu and Arg, the percentage of His, 
the counts of Ser, Nitrogen and Val, the frequency of Asp, weight, 
the percentages of Try and Meth, the count of hydrophilic 
residues, non - reduced Cys extinction coeflicient at 280 nm and 
the frequency of Phe. 

Weighting by Information gain ratio. When this algorithm 
was applied to the dataset, three attributes (the percentage of Try, 
the frequency of Ala and non - reduced Cys extinction coeflicient 
at 280 nm) weighed the highest possible weights (1.0). Thirty other 
attributes had weights equal to or higher than 0.50. 

Overall, the number of protein attributes that gained weights 
higher than 0.5 in each weighting model were as follows: PCA (1 
attribute), SVM {5 attributes). Relief {13 attributes), Uncertainty (16 
attributes), Gini index (15 attributes), Chi squared (13 attributes). 
Deviation (1 attribute), Rule (14 attributes). Gain ratio (33 attributes) 
and Info Gain Ratio (18 attributes). The most important attributes 
that were confirmed by different weighting algorithms to be 
involved in differentiation of HA protein are shown in Table 1 . 

Key Structural Protein Attributes Distinguishing Different 
Influenza HA Subtypes based on Overall Conclusion of 
the Attribute Weighting Algorithms 

The protein attributes which announced important by most of 
the attribute weighting algorithms (intersection of different 
weighting methods) were assumed as the key distinguishing 
protein features in HA subtvping and are presented in Table 1. 
Based on the mentioned intersection of attribute weighting models 
(Table 1), 14 protein attributes were announced as the key 
distinguishing features in structure of HA subtypes, including The 
frequency of Gin, The frequency of Tyr, Percentage of Tyr, 
Percentage of Cys, The frequency of Glu, Percentage of Try, 
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Table 1. The most important protein attributes (features) in structure of different HA subtypes selected by different attribute 
weighting algorithms. 



Attribute 


The number of attribute weighting algoritms that indicated the attribute as important 


The frequency of Gin 8 


Percentage of Cys 


7 


Percentage of Tyr 


7 


The frequency of Tyr 


7 


The frequency of Glu 


7 


Percentage of Try 


7 


Count of lie 


6 


The frequency of Arg 


6 


Percentage of His 


6 


The frequency of Asp 


6 


Percentage of Met 


6 


Non-reduced Cys extinction 
coefficient at 280 nm 


6 


The frequency of Phe 


5 



Total number of attribute weighting algorithms which have announced the certain attribute important (weight higher than 0.5, Table S4). This table presents the 
number of algorithms that selected the attribute. Weighting algorithms were PCA, SVM, Relief, Uncertainty, Gini index, Chi Squared, Deviation, Rule, Information Gain, and 
Information Gain Ratio. 
doi:1 0.1 371 /journal.pone.0096984.ta01 



Count of He, The frequency of Arg, Percentage of His, The 
frequency of Asp, Percentage of Met, Non-reduced Cys extinction 
coefficient at 280 nm and The frequency of Phe. 

Decision Trees 

Of the 1 76 generated trees, most of them generated good and 
meaningful trees and just one of them {Random Tree with accuracy 
criteria) did not result in tree with roots and leaves. 

The count of Tyr was the sole attribute used to build a single - 
branched tree when the Decision Stump model on Gini Index criterion 
was applied to the dataset (Figure 1). When the value for this 
feature was equal to 26, 27, 28, 29 or 30, the virus fell into the HI 
class; if the value was equal to 18, 19, 20, 21 or 22, the virus 
belonged to the H3 class. When the count of Tyr was equal to 17, 
the subtype of the virus was H4; but when the value was equal to 
23 or 24, the virus was associated with H5. H6 virus subtype was 
identified when the count of Tyr was equal to 26. Finally, when 
the value was equal to 13, 14 or 15, the virus fell into the H7 class. 

Other simple trees were produced by the Decision Stump and 
Decision Tree models on the accuracy criterion with just one branch. 
The frequency of positively charged residues was the sole protein 
attribute used to build this tree. Similarly the Decision Tree Parallel 
(run on Gain Ratio criterion) generated a tree with one branch. This 
branch is based on the frequency of Pro - Ala with a frequency 
equal to or less than 0.5 indicating a HI subtype. The Decision 
Stump (run on gain ratio) model induced a one-level tree showing 
the percentage of Try to be the most important feature. 
Specifically, when the value for this feature was higher than 
1.407, the virus fell into the HI subclass, but when the value was 
equal to or higher than 1.407, the virus belongs to the H5 class. 
Higher power of trees in distinguishing HI and H5 is because of 
higher numbers of these subtypes and consequendy, ans conse- 
quently better training and higher capability of theses models in 
recognition of HI and H5. 

In other decision trees such as those produced by the Random 
Tree (on Gini index), dipeptide features (such as the frequency of Pro 



- Gly) were the main protein feature to build the trees and the 
following protein attributes were used to build the tree branches: 
the count of Phe - Met, the count of Asn - Met and the frequency 
of Trp - Leu. All virus subclasses (except H6, H8, HIO, Hll and 
HI 4) were classified by this model (Figure 2). This model was one 
of the most successful models in distinguishing H subtypes. 

Decision Tree Parallel model (run on Infotmation gain, Gini index or 
Accuracy criteria). Decision Tree model (run on Information gain, Gini 
index or Gain ratio) and Random Tree model (run on Information gain) 
induced very complex trees to distingxiish between all virus 
subclasses using protein attributes. 

Three models oi Decision Stump {Gain ratio, Gini Index and Accuracy) 
induced trees with just one branch, and the percentage of Try, the 
percentage of Tyr and the frequency of Glu were the most 
important protein attributes, respectively. Decision Tree model (run 
on Accuracy) built a two branches tree; if the frequency of Glu was 
equal to or less than 0.054, the virus fell into H3, but when the 
value for this feature was higher than 0.054 and the weight of 
protein was higher than 63.971, the virus belonged to H5. If the 
weight was equal to or less than 63.971 and the frequency of Tyr 
was higher than 0.028, the virus was from the HI class; otherwise 
from the H7 class. The other models induced more complex tress. 

The percentage of Tyr, the percentage of Try and the frequency 
of Glu were the most important features selected by Decision Stump 
models to induce a simple tree with one branch. A two branches 
tree induced by Decision Tree model (run on Accuracy), and the 
frequency of Glu and the weight and non - reduced Cys extinction 
coefficient at 280 nm were the key protein attributes to classify 
HI, H3, H5 and H7. The other models induced more complex 
tress. Again, Stump Decision models induced simple trees and the 
count of other residues, the percentage of Tyr and the percentage 
of Try were the most important features to build the trees on the 
Info Gain Ratio database. The accuracies of different decision tree 
models are compared in Table 2. 



PLOS ONE I www.plosone.org 



4 



May 2014 | Volume 9 | Issue 5 | e96984 



HA-Subtyping in the Level of Physic-Chemical Characteristics 



[court ofT 



= 13.n= 14.0= 15.0= 16.0= 17.0= 18.0= 1S.0 = 20.0 = 21.0 = 22.0 = 23.0 = 24.0 = 25.0 = 26.0 = 27.0 = 28.0 = 29.0 = 39.0 















i 




























HT 




HI 




m |JLj H11 ^-Cj H4 


H3 




H3 




H3 






H5 




HS 




HC 


HI 


> 


HI 




HI 




HI 




HI 



1 



1 



I 



Figure 1. Decision Tree from Decision Stump model ran with G//7/7n<yex^ criterion. As may be inferred from the figure, the count of Tyr was the 

most important and the sole protein attributes in distinguishing various HA subtypes of influenza virus A. When the value for this feature was equal 
to 26, 27, 28, 29 or 30, the virus fell into the HI class; if the value was equal to 18, 19, 20, 21 or 22, the virus belonged to the H3 class. While the count 
of Tyr was equal to 17, the subtype of the virus was H4; but when the value was equal to 23 or 24, the virus was associated with H5. H6 virus subtype 
was identified when the count of Tyr was equal to 26. Finally, when the value was equal to 13, 14 or 15, the virus fell into the H7 class. Underneath, 
the host species for each virus class has been depicted. 
doi:l 0.1 371 /journal.pone.0096984.g001 



Machine Learning Models to Predict Unknown Influenza 
A Virus Classes Based on Protein Attributes 

Decision trees. The accuracies of trees induced by various 
decision tree models are presented in Table 2. Generally, most 
models showed accuracies higher than 80% while the lowest 
accuracies gained by Decision Tree models ran on FCdb (the original) 



dataset with Info Gain criteria (average of 46.04%). The best 
predicted accuracy achieved when Decision Tree run with Gini Index 
criteria on Uncertainty dataset (99.70%) (Figure 2). 

SVM approach. The total accuracy predicted by C-SVC 
method (when Gamma and C were 0.0065 and 10, respectively) 
reached 94.52% ±0.86% with the RBF kernel innciion (the lowest 
and highest prediction rates — 65.19% and 100% — obtained for 
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Figure 2. Decision Tree from Random Tree model ran with Gini Index criterion. As may be inferred from the figure, the frequency of Pro - Gly 
was the most important protein attributes to build the tree and the counts or the frequencies of other dipeptides used to generate the tree branches 
and to distinguish various HA subtypes of influenza virus A. With the defined valuse for the count of Phe - Met, the count of Asn - IVIet and the 
frequency of Trp - Leu, the virus subtypes were either H3 or H5. With different values for the count of Asn - Met, various virus subtypes distinguished. 
All virus subclasses (except H6, H8, HIO, Hll and H14) were classified by this model. Underneath each subtype common host has been depicted. 
doi:l 0.1 371/journal.pone.0096984.g002 
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H5 and HI, respectively). The average number of support vectors 
was 161.88 with the lowest number observed for class H15 and 
H 1 6 (3) while the highest number of support vectors was gained 
for H3 (527). 

When Gamma and C changed to 0 and 10, the total prediction 
accuracy reached to 99.70% ±0.15%. The average number of 
support vectors for this run decreased to 54.50 with the lowest 
number (3) again for the same classes (HI 5 and HI 6) and the 
highest number for HI class (181). The accuracy did not improve 
by using the other kernel functions (linear, poly, sigmoid, and pre- 
computed). 

Naive bayes. As seen in Table 3, the lowest accuracy 
(63.36%) gained when Maive Bayes model ran on SVM dataset. In 
contrast, the best accuracy (99.22%) gained when Bayes Kernel 
model ran on FCdh. The accuracy of Naive Bayes model ran on 
FCdb was 98.95% ±0.30% while the same accuracy for Naive Bayes 
Kernel was higher (99.27% ±0.20%). Distribution model for label 
attribute (virus subtype) ranged from 0.0001 for H15 and H16 
classes up to 0.460 for HI classes for both Naive Bayes and Naive 
Bayes Kernel. 

Neural network. Tables 3 presents the comparative perfor- 
mances of different combinations of neural network algorithms 
with 11 datasets [original dataset (FCdb) as well as datasets pre- 
filtered by PCA, Uncertainty, Rehef, Chi Squared, Gini Index, 
Deviation, Rule, Gain Ratio, Info Gain, and SVM weighting 
algorithms based on 10-fold cross validation. The average 
accuracies oi Auto MLP and Neural Net models were generally high 
(more than 95%, presented in Table 3). 

As it can be inferred from Table 3, pre-trimming of protein 
features dataset with proper attribute weighting algorithm plays a 
determining role in obtaining high prediction accuracy. When the 
models ran on PCA databases, the accuracies were minimum 
(78.81% and 81.44%, respectively), while the best accuracy 
(99.73%) gained when either Neural Net or Auto MLP models ran 
on Info Gain Ratio dataset or Chi Squared datasets. The 
performances of neural network models on other datasets (such 
as Gini Index, Info Gain, Relief, Rule, SVM, and Uncertainty) were also 
higher than 99% showing their suitability on predicting the right 
HA virus classes. 



Statistical Analysis of the Key Found Protein Attributes in 
the Structure of HA Subtypes 

Statistical distribution of important key amino acid attributes in 
the structure of HA subtypes is presented in Table S2. 
Interestingly, The freq of Glutamine and The freq of Tyrosine 
which were selected by most of attribute weighting algorithms 
(Table 1) and decision tree models (Figure 1) had the highest 
variation between HA subtypes. This confirms the high efficiency 
of attribute weighting algorithms in feature selection and finding 
the elfective features. 

In agreement with the outcomes of attribute weighting 
algorithms, ANOVA showed that that in respect to dilferent HA 
subtypes (as dependent variable), chflFerences between important 
structural features including Non-reduced cysteines Extinction, 
Count of Isoleucine, Freq of Cysteine, Freq of Aspartic Acid, Freq 
of Glutamic Acid, Freq of Glutamine, Freq of Arginine, Freq of 
Tyrosine, Percentage of Histidine, Percentage of Methionine, 
Percentage of Tryptophan, Percentage of Tyrosine, Count of Phe- 
Met, Freq of Pro-Gly, and Freq of Trp-Leu are highly significant 
(p = 0.01) (Table S3). Interestingly, the Freq of Glutamine and 
Freq of Tyrosine received the highest R-square (95.75 and 96.47, 
respectively) implying that these two attributes, highlighted by 
decision tree and weighting algorithms (Figure 1 and Table 1), are 
the key predictors of HA subtyping. It should be mentioned that 
higher value of R-square denotes that higher amount of variance 
of dependent variable (HA subtypes) is related to independent 
variable (Freq of Glutamine and Freq of Tyrosine). 

MANOVA (multivariariate analysis) also confirmed the overall 
significant dilferences (p = 0.05) of the selected features between 
different HA subtypes (Table S3). Clustering, based on the 16 
important features is presented in Figure S 1 . As it can be inferred 
in Figure S 1 , the key discovered features in this study have highly 
accuracy in prediction and separation of HA subtypes and 
reinforces the high capability of discovered features as predictors 
of HA subtyping and underlying layer of subtype dilferentiation. 



Table 3. The accuracy of Baysian and Neural Network models on various datasets [1 1 datasets including original protein features 
dataset (FCdb) as well as 1 0 datasets generated by trimming (filtering) the original FCdb dataset by attribute weighting algorithms] 
computed by 10-fold cross validation. 







Baysian Models 




Neural Nets Models 




Bayse Kernel 


Naive Bayse 


Auto MLp 


Neural Net 


Chi Squared 


98.72% 


98.59% 


99.71% 


99.73% 


Deviation 


88.55% 


96.70% 


79.94% 


82.71% 


Gini Index 


98.90% 


98.26% 


99.70% 


99.69% 


Info Gain 


98.12% 


98.79% 


99.70% 


99.70% 


Info Gain Ratio 


84.38% 


63.36% 


99.70% 


99.73% 


PCA 


99.22% 


98.69% 


78.81% 


81.44% 


Relief 


98.95% 


98.65% 


99.63% 


99.59% 


Rule 


98.53% 


98.97% 


99.70% 


99.67% 


sm 


84.38% 


63.36% 


98.44% 


98.37% 


Uncertainty 


92.69% 


97.89% 


99.71% 


99.71% 


FCdb 


%99.18 


%97.51 


%99.73 


99.69% 



This table presents the accuracy percentage of Baysian {Naive Bayes and Bayse Kernel} and Neural Network models {Auto MLp and Neural Net) run on all 10 datasets. 
doi:l 0.1 371/journal.pone.0096984.t003 
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Discussion 

In the last century, three influenza A pandemics happened: 
HlNl in 1918, H2N2 in 1957 and H3N2 in 1968 [73]. Since 
1997, avian H5N1 and recendy avian H7N9 influenza have been 
infecting humans zoonotically with high mortality rates. The 2009 
pandemic was the result of HlNl multiple reassortant with genes 
derived from viruses that originally circulated in the swine, avian, 
and human populations [74]. 2013 China pandemic was the result 
of novel reassortant in avian influenza (H7N9) with six internal 
genes from avian influenza A (H9N2) viruses. Substitution Q226L 
(H3 numbering) at the 210-loop in the HA gene was found in two 
pandemic strains [3]. To be prepared for future pandemics, a 
detailed understanding of the basic biology of this virus its 
evolution and subtype difierentiation is critical. 

An important point of this study was discoverey of some new 
rules underpining HA differentiation, not in the le\'el of sequence 
but in the level of structural j)r()tcin archit(;chture via extraction 
and data mining of structural protein features. Influenza virus is 
also subjected to host immune pressure (enriched by vaccines in 
many cases) and undergoes rapid evolution in the antigenic 
regions, especially when the virus crosses the host species barrier 
[75]. Influenza virus does not have enough time to develop/add 
long domains or major changes in proteins; instead, it is 
reasonable that influenza viruses alter short amino patterns. 
Consequendy, rapid structural amino acid alteration is an strategy 
for influenza to survive. Based on the key amino acid attributes, we 
found the possible routes (based on decision tree algorithms) that 
influenza can acquire additional host specificity of another subtype 
via a small change in amino acid attributes (Figure 2). Through 
theses routes, influenza has more chance in increasing host range 
and survival by acquiring host specificity of another subtype with 
less energy consumption and minimum change in protein 
structure. 

This is the first successful high accurate attempt in modelling 
and prediction of influenza A virus subtypes based on physic- 
chemical properties of HA proteins. It should be noted that 
unbalanced number of HA subtypes can influence the efliciency of 
attribute weighting and modelling. Considering that each attribute 
weighting system uses a specific pattern to define the most 
important features, results may vary according to modelling 
technifjues as has been highlighted in previous studies 
[23,27,28,72,76,77]. Despite this possibility, our analysis revealed 
a high level of accuracy. Applying 10 statistically different attribute 
weighting algorithms and selection of the key features based on the 
overall (intersection) of these algorithms reinforce the importance 
of the selected features. Furthermore, weighting algorithms such as 
Relief can deal with unbalanced data by taking subsamples and 
reduce this bias [78]. Altogether, achieving the precise modelling 
approach in linking the amino acid attributes with HA subtyping is 
the result of the following improvements in the present study: (1) 
increasing the number of computationally calculated amino acid 
features to cover different aspects of HA protein structure, (2) 
testing different feature selection (attribute weighting) algorithms 
and selection of the most important amino acid attribute based on 
the overall conclusion of algorithms, (3) examining different 
supervised data mining (machine learning) algorithms, and (4) 
joining attribute weighting with different data mining algorithms 
which sharply increases the accuracy of the models in some cases. 

The frequency of Gin was the most important feature to 
distinguish virus subtypes, as defined by 80% of the attribute 
weighting algorithms while the percentages of Cys, Tyr and Try 
and the frequencies of Tyr and Glu were defined as important 
features by 70% of algorithms (Table 1). Gin, Cys, Tyr and Glu 



have been c:lassified as polar or hydrophUic amino acids. Thus, our 
results confirm the importance of hydrophUicity in forming HA 
proteins. In fact, hydrophobicity has been used as a vital 
parameter in designing a new anti-viral vaccine against HA 
proteins [79]. It has been shown that an increase in the 
hydrophUicity of the receptor binding region is apparentiy an 
evolutionary adaptation of the 2009 HlNl pandemic influenza A 
virus in 1918 Spanish, 1930 swine, and 2005 seasonal strains [80]. 
It has also been identified that mature peptide sequences of HA 
genes isolated from humans in 2009 have Gin at position 226 of 
the receptor binding site [81]. Int("rcstiiigly, this was different from 
previously isolated viruses where the presence of Leu at the same 
position contributes to a preference for human receptors whereas 
the presence of Gin contributes to a preference for avian receptors 
[81]. 

Some generated models, such as Decision Tree from Random Tree 
model ran with Gini Index criterion (Figure 2), provide novel 
knowledge in structural amino acid architecture and packaging of 
different HA proteins. It shows that count of specific dipeptides 
including Phe-Met and Asn-Met structurally differentiate HI from 
H3. Theses dipeptides are the possible modulated amino acid 
features in the structure of future Influenza A viruses tending to 
acquire host range of both HI and H3 subtypes. Figure 2 (left 
branch of tree) predicts that mutation or reassortment in H5 
resulting in lower number of Trp-Leu which may be considered as 
a key requirement for increasing host range and possible future 
pandemics by H5 subtype. Similar conclusion has been made 
recently documenting that tlu' A/Indonesia/5/2005 avian A/ 
H5N1 influenza virus may require as few as five amino acid 
substitutions [17,82] and the A/Vietnam/ 1203/2004 requires 
four substitutions and reassortment [14,18] to become transmis- 
sible between ferrets via respiratory droplets. Interestingly, Russell 
et al., 2012 demonstrated that these substitutions can arise in 
nature [19]. Based on the above discussion, the presented 
bioinformatics models in this study may open a new avenue in 
predicting the structure of future influenza A pandemics. 

Up to now, there has been littie discussion about the role of di- 
peptides in protein function. Our recent study has already 
demonstrated that specific di-peptides play the central role in 
protein halostability and thermostability [28,72]. The role of some 
di-peptides, such as Pro - Ala, Pro - Gly, Phe - Met, Trp - Leu 
and Trp - Pro (selected as important protein features by decision 
tree models to identify influenza A subtypes) was indentified in the 
present study. Interestingly, as presented in Figure 2, wide-speared 
HA subtypes of influenza virus (H3 and HI) have lower 
frequency/counts of particular dipeptides such as Asn-Met, Phe- 
Met, and Pro-Gly. Decreasing the number of tough structural 
dipeptides provides more flexibiUty for virus to attack surface ceUs 
of different hosts and escap host immune system. 

The ability of various decision tree induction models applied in 
this study to correctly classify influenza A subtypes based on 
protein attributes varied cxmsiderably. In some models, very few 
subtypes (two or three) were identified which illustrate the 
incompetence of these models. However, other models, such as 
Stump and Random tree run on removed correlated features dataset 
were able to completely classify the HA subtypes based on their 
protein features. The latter models are suitable tools to classify 
those viral subtypes. As shown in Table 2, the overall accuracies 
for tree induction models were generally high and improved when 
joined with appropriate feature weighting criteria. For example, 
the accuracy for Decision Tree model with Info Gain criterion run on 
FCdb was 46.04%, but improved to 99.06% when the criterion 
changed to Accuracy, indicating a very sharp increase in the model 
accuracy and performance. The best accuracy achieved when the 
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Random Forest model ran with Gini Index criterion (99.70%) which 
makes it the best model to apply in such conditions. The 
performances oi Naive Bayes models (with and without kernel) were 
also high enough (more than 99%), which also makes these models 
to cluster HA subtypes into the right classes with high accuracies. 

SVM is a supervised non-parametric statistical learning 
technique with there are no assumption problem that usually 
involve in identification of multiple classes (more than two). 
Adjustments are made to the simple SVM hinaTj classifier to 
operate as a multi-class classifier using methods such as one- 
against-aU, one-against-others, and directed acyclic graph. Since 
example data is often not Unearly separable, SVMi introduce the 
notion of a '^kernel induced feature space" which casts the data into a 
higher dimensional space where the data is separable. Overall, 
SVA4m are intuitive, theoretically well- founded, and have been 
shown to he practically applicabk;. The methods have been widely 
employed by researchers in different areas of science [83-88], 
including influenza research [89-93]. In general, the RBFkemelis a 
reasonable first choice in SVM approaches. This kernel nonlinearly 
maps samples into a higher dimensional space. Thus, unlike the 
linear kernel, it can handle the situation when the relationship 
between class labels and attributes is nonlinear [93]. In this study, 
RBF kernel function applied and showed high level of accuracy in 
classifying 16 classes of HA viruses based on their protein features. 
These results illustrate for the first time that RBF kernel is an 
appropriate model (on dataset scaled by grid search) to find the key 
underlying physic-chemical charateristics of HA subtypes and 
predict based on them. Convergence of training with SVM (which 
is a deterministic quadratic optimization procedure) is much faster 
than neural network (which is randomized procedure). 

Sequence-alignment based methods (BLAST and phylogenic 
trees), drawn by nucleic acid or amino acid sequence alignments, 
have been extensively employed as the basis for evolutionary 
studies. However, homology-based methods does not consider the 
structural and functional features of proteins during evolution 
[24]. The presented approch in this study based on the machine 
learning algorithms running on structural protein features provides 
a new evolutionary pathway separation of HA a subtype which 
takes into account the structural reasons of this diversity. As 
discussed before [24], the presented procedure can significantly 
enrich and qualify any type of further evolutionary studies by 
completion of the common sequence homology based phylogenic 
analysis methods. 

There is a major difference between the presented method in 
this study with the "composition vectors based methods" as 
another alignment-free method. The "composition vectors based 
methods" use a sting or repeats (amino acid or nucleotide), with a 
limited number between 1 to in maximum 7 and count the 
number of these repeats within different genomes or genes [94— 
96]. If we consider each repeat as a feature (attribute); in total, we 
wiU have 7 features. In our method, "large scale supervised mining 
of protein attributes", we have calculated 896 physic-chemical 
protein characteristics for each sequence which offers a compre- 
hensive view on underlying protein architecture and shed light on 
the key protein characteristics which govern HA subtyping. 

This research will serve as a basis for future studies on 
prediction of the structure of future HA sequences which will be 
capable of infecting a broad range of hosts. Applying this approach 
on other influenza protein segments in particular N and M2e (the 
other surface protein) wiU complete the puzzle of underling 
structural protein architecture of influenza subtypes and the 
possible structural changes which happens during host range 
increase. 



Based on the findings of this study, it is possible to predict 
antigenic variation of any input influenza sequence using 
haemagglutinin amino acid composition. This prediction does 
not need similarity searches or gathering information about the 
complex, expensive, and time-consuming features of the tertiary 
and quaternary protein structure or any need to laljoratory 
activity. The developed models can be further embedded in web- 
based applications or softwares to predict possible pandemic 
strains from recendy observed or computationally guessed HA 
sequences. Extraction of a large number of protein characteristics 
and pattern recognition trough supervised machine learning 
models can also be employed in future studies on understanding 
the interaction between antigen and antibody (immunity reactivity 
measured by ELISA/Western Blot [97]) and finding the key 
underlying physic-chemical characteristics in the structure anti- 
body and antigen which can result in vaccine breakdown. 

In addition, our findings add to the growing body of literature 
on the molecular biology of influenza virus, which are urgently 
needed by many industries and vaccine-producing institutes. 
Furthermore, our method will serve as a model for future 
investigation on NA and M2e amino acid compositions, as well 
as the interaction between HA, M2e, and NA antigens. This study 
suggests that amino acid profiling of influenza surface proteins has 
the potential to monitor host specificity. 

Materials and Methods 

i: Extraction of Structural Protein Attributes Based on HA 

Sequences 

Seven thousand and three hundred and thirty eight sequences of 
HA virus proteins from various species (human, bird, pig, horse, 
mouse, etc) were extracted from the Influenza Research Database 
(http://w^\'^v.fludb.org/) and categorized as HI to H16, according 
to the database classification. Some sequences of avian influenza 
were obtained based on AH/2006/0,50 and AH/2010/039 
projects supported by Australian Centre for International Agri- 
cultural Research (ACIAR). 

For each of these sequences, eight hundred and ninety six 
protein features (attributes) such as length, weight, isoelectric 
point, count and frequency of each element (carbon, nitrogen, 
sulphur, oxygen, and hydrogen), count and frequency of each 
amino acid, count and frequency of negatively charged, positively 
charged, hydrophilic and hydrophobic residues, count and 
frequency of dipeptides, number of a-helix and P-strands, and 
other secondary protein features were extracted using various 
bioinformatics tools and software from the ExPASy site (http:// 
www.expasy.org) and CLC bio software (CLC bio, Finlandsgade 
10-12, Katrinebjerg 8200 Aarhus N Denmark). All features were 
classified as continuous variables, except virus subtypes, virus host 
and N-terminal amino acids, which classified as categorical. A 
dataset of these protein features imported into RapidMiner 
software [RapidMiner 5.0.001, Rapid-I GmbH, Stochumer Str. 
475, 44227 Dortmund, Germany]. Virus subtypes (H1-H16) was 
set as the output variable (label) and the other variabl(;s were set as 
input variables. The following steps were then applied to the 
dataset. 

ii: Data Cleaning 

Initially, duplicate features were removed by comparing all 
examples on the basis of the specified selection of attributes (two 
examples were assumed equal if all values of all selected attributes 
were equal). Then, we removed the superfluous attributes from the 
dataset. Nominal attributes were regarded as superfluous when the 
most frequent values were contained in more or less than nominal 
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useless above or below percent of all examples. Numerical 
attributes which possessed standard deviations less than or equal 
to a given deviation threshold (0. 1) were assumed to be superfluous 
and removed. Finally, correlated features (with Pearson correlation 
greater than 0.9) were omitted. After cleaning, the number of 
attributes decreased from 896 to 893. We called this dataset final 
cleaned dataset (FCdb) and used for next attribute weighting 
analysis. This dataset is represented in Table SI. 

iii: Application of Attribute Weighting Algorithms to Find 
the Most Important Structural Protein Attributes 
Distinguishing HA Subtypes 

To determine how structural protein features of HA sequence 
determines its final subtype, 10 different attribute weighting 
algorithms including weighting by PCA, weighting by Uncertain- 
ty, weighting by Relief, weighting by Chi Squared, weighting by 
Gini Index, weighting by Deviation, weighting by Rule, weighting 
by Gain Ratio, weighting by Info Gain, and weighting by SVM 
were applied on final cleaned database (FCdh, Table SI). Attribute 
weighting algorithms find the most important protein attributes 
which differ in protein structure between different HA subtypes. 
The protein attributes which announced important by most of 
attribute weighting algorithms (intersection of different weighting 
methods) were assumed as the key distinguishing features of HA 
subtypes. 

Ten new generated datasets produced by trimming (filtering) of 
the original dataset [FCdb) via attribute weighting algorithms as 
well as the original FCdb dataset (1 1 datasets in total) were used as 
input for Decision Tree, Baysian, and Neural Network models. 

Weight by Information gain. This operator calculated the 
relevance of a feature by computing the Information Gain in class 
distribution. 

Weight by Information Gain ratio. This operator calculat- 
ed the relevance of a feature by computing the information Gain 
Ratio for the class distribution. 

Weight by rule. This operator calculated the relevance of a 
feature by computing the error rate of a one R Model on the 
example set without this feature. 

Weight deviation. This operator created weights from the 
standard deviations of all attributes. The values were normahsed 
by the average, the minimum, or the maximum of the attribute. 

Weight by chi squared statistic. This operator calculated 
the relevance of a feature by computing, for each attribute of the 
input example set using chi-squared statistic with respect to the 
class attribute. 

Weight by gini index. This operator calculated the relevance 
of an attribute by computing the Gini index of the class distribution, 
if the given example set would have been split according to the 
feature. 

Weight by uncertainty. This operator calculated the rele- 
vance of an attribute by measuring the symmetrical uncertainty 
with respect to the class. 

Weight by relief. This operator measured the relevance of 
features by sampling examples and comparing the value of the 
current feature for the nearest example of the same and of 
different class. This version also worked for multiple classes and 
n-grcssion data sets. The resulting weights were normalised into 
the interval between 0 and 1. 

Weight by SVM (Support Vector Machine). This operator 
used the coefficients of the normal vector of a linear SVM as 
feature weights. 

Weight by PCA (Principle Component Analysis). This 
operator used the factors of the first of the principal components as 
feature weights. 



iv: Attribute Selection and Generation of New Pre- 
trimmed Cprotein Feature Datasets 

After running attribute weighting models were run on FCdb 
dataset (original clean dataset of protein features), each protein 
attribute (feature) gained a value between 0 and 1, which reveales 
the importance of that attribute with regard to a target attribute 
(HA subtypes). As mentioned before, an attribute was assumed 
important if that attribute received weight higher than 0.5 (>0.5) 
by a certain attribute weighting algorithm (Table S4). The protein 
attributes which announced important by most of attribute 
weighting algorithms (intersection of different weighting methods) 
were assumed as the key distinguishing protein features in HA 
subtyping and are presented in Table I . In conclution. 

All variables with weights higher than 0.50 were selected and 
10 new datasets were created according to 10 applied attribute 
weighting algorithms. These newly formed datasets were named 
according to their attribute weighting models [Information gain, 
Information gain ratio, Rule, Deviation, Chi Squared, Gini index. 
Uncertainty, Relief, SVM and PCA) and were used to join with 
subsequent models (supervised). Each model of supervised or 
unsupervised clustering were performed 1 1 times; the first time it 
ran on the main dataset [FCdb) and then on the 10 newly formed 
datasets from attribute weighting and selection. 

V: Machine Learning Models to Predict Unknown 
Influenza A Virus Classes Based on Protein Attributes 

As mentioned above, the original FCdb dataset as well as 
1 0 new generated datasets produced by trimming (filtering) of the 
original dataset (FCdh) \ia attril)ute weighting algorithms were 
used as input for machine learning models. 

Four classes of machine learning models (Decision Trees, SVM, 
Baysian and Neural Network algorithms) were applied on all 1 1 
datasets to find suitable modeI(s) in order to predict unknovra 
classes of influenza A virus virus based on the computed protein 
attributes computed. 

To prevent overfitting and calculate the accuracy of each 
model, 10-foId cross validation was employed in this study to train 
and test models on all patterns. Ten-fold cross validation is a 
standard and commonly used method for evaluating classifier 
methods, as set of proteins used for training and testing are 
mutually exclusive. Details of this \vcll-known method and its 
principals have been extensively discussed in many previous 
published papers [18-29]. 

Based on 10-fold cross validation; records were divided into 10 
nearly equal parts randomly. Our dataset had 7353 records in 
total. In other words, records (7353) were randomly divided into 
10 parts: 9 parts consisted of 735 records, and the last one 
contained 738 records. When prediction algorithms such as SVM, 
Neural Network or Tree Induction models were performing, nine 
sets (parts) were used for training and the 10th one for testing. 
Then, in next run, another part sc't as tc'sting set and the other 9 
parts as training sets. The process was repeated 10 times and the 
accuracies for true, false and total accuracy were calculated. The 
final accuracy reported as the average of the accuracy in all ten 
tests (runs). 

In each run, the predictor or machine learning system did not 
expose to test set and just trains on 9 training sets (so this diminish 
the possibility of overfitting into zero). When the system is trained 
well and the calculated accuracy is reached at least to 0.85%, then 
the model tries to predict or guess the unknown test set and 
calculates the accuracy for this set. Repeating the same procedure 
for 1 0 times and non-using of the test set for training are the bases 
of preventing the learning application from overfitting. 
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Decision tree. Sixteen machine learning models run on four 
decision tree algorithms {Decision Tree, Decision Tree Paralkl, Decision 
Stump and Random Forest) with four different criteria {Gain Ratio, 
Information Gain, Gini Index and Accuracy) on all 1 1 datasets. A 
decision stump is a Decision Tree, which uses only a single 
attribute for splitting. For discrete attributes, this typically means 
that the tree consists only of a single interior node (i.e., the root has 
only leaves as successor nodes) [98] . If the attribute is numerical, 
the tree may be more complex. Models trained and tested with 
ten-fold cross validation and the average of accuracies were 
computed as stated above. 

Support Vector Machine (SVM). SVM% are popular and 
powerful techniques for supervised data classification and predic- 
tion, so four different SVM models {SVM, LibSVM, Linear SVM and 
Evolutionary SVM) were used here to implement models for 
prediction of Influenza A classes based on protein features. Briefly, 
all 11 databases [original protein feature dataset (FCdb) and 10 
datasets genc-ratcd by trimming (filtering) FCdb dataset by PCA, 
Uncertainty, Relief, Chi Squared, Gini Index, Deviation, Rule, 
Gain Ratio, Info Gain, and SVM weighting algorithms] were 
transformed to SVM format and scaled by grid search (to avoid 
attributes in greater numeric ranges dominating those in smaller 
numeric ranges) and to find the optimal values for operator 
parameters. 

To prevent overfitting problems, again 10-fold cross validation 
was applied and the averages of accuracies were computed. RBF 
kernel which nonlinearly maps samples into a higher dimensional 
space and can handle the case when the relation between class 
labels and attributes is nonlinear used to run with SVM models. 
Grid search is an appropriate way to determine the optimal values 
for two major parameters of the RBF (parameters C and gamma). 
Other kernels such as linear, poly, sigmoid and pre-computed 
were also applied to the datasets to find the best accuracy. 

Nsuve bayes. Naive Bayes based on Bayes conditional 
probability rule was used for performing classification and 
prediction tasks. Naive Bayes assumes the predictors are statistically 
independent which makes it an effective classification tool and easy 
to interpret. Two models. Naive base (returns classification model 
using estimated normal distributions) and Naive base kernel 
(returns classification model using estimated kernel densities) 
trained with 1 0-fold cross validation on all 1 1 databases (original 
dataset as weU as 10 dataset generated by applying 10 different 
attribute weighting algorithms) and the model accuracies in 
pr(-dicting the right HA virus class computed as stated before. 

Neural network. Two neural networks models [Neural Net and 
AutoMLP) trained with 1 0-fold cross validation on all 11 databases. 
The model accuracies in predicting the right HA virus class were 
computed as stated before. In other words, we used 10-fold cross 
validation of training subsets to compare the prediction accuracy 
of "neural network with neural network criterion" versus "neural 
network with AutoMLP criterion" on original dataset (FCdb) as 
well as 10 datasets trimmed (filtered) \'ia running weighting 
algorithm by PCA, Uncertainty, Relief, Chi Squared, Gini Index, 
Deviation, Rule, Gain Ratio, Info Gain, and SVM to find the 
optimal combination of neural network algorithm with dataset 
which allows the most accurate prediction (based on 10-fold cross 
validation). 

Neural Net learns a model by means of a feed-forward neural 
network trained by a ba[:k-propagation algorithm (multi-layer 
perceptron) and the structure of the neural network can be defined 
by parameter list "hidden_layers". Here a default hidden layer 
with sigmoid type and size (number of attributes+number of 
classes)/2+l created and added to the net. The used activation 
function is the usual sigmoid function. Therefore, the values 



ranged of the attributes scaled to —1 and -Hi. The type of the 
output node was sigmoid because the learning data described a 
classification task. 

AutoMLP algorithm combines ideas from genetic algorithms and 
stochastic optimization. It maintains a small ensemble of networks 
that are trained in parallel with different rates and different 
numbers of hidden units. After a small, fixed number of epochs, 
the error rate was determined on a validation set and the worst 
performers were replaced with copies of the best networks, 
modified to have different numbers of hidden units and learning 
rates. Hidden unit numbers and learning rates are drawn 
according to probability distributions derived from successful rates 
and sizes. 

Vi: Statistical Analyses of the Key Distinguishing Protein 
Attributes of HA Proteins (Selected by Attribute 
Weighting Algorithms) 

Various statistical methods such as descriptive statistics. 
Univariate Analysis of Variance (ANOVA) and Multivariate 
Analyses of Variance (MANOVA) were applied to investigated 
the behaviour and the effects of some important amino acid 
attributes (selected by attribute weighting algorithms) and some of 
the important dipeptides (selected by decision tree) models on HA 
subtypes (H1-H16). The features included: Non-reduced cysteines 
Ext, Count of Isoleucine, Freq of Cysteine, Freq of Aspartic Acid, 
Freq of Glutamic Acid (E), Freq of Glutamine, Freq of Arginine, 
Freq of Tyrosine, Percentage of Histidine, Percentage of 
Methionin, Percentage of Tryptophan, Percentage of Tyrosine, 
Count of Phe-Met, Count of Asn-Met, Freq of Pro-Gly, and Freq 
of Trp-Leu. 

Also, the agglomerative hierarchical cluster analysis was applied 
on dataset of the above mentioned key features consisting of 16 
protein attributes per each sequence. These given variables 
obtained by the feature selection criteria were standardized in 
order to be equally important in computing distance [99]. The 
method used in this cluster analysis was single linkage. Manhattan 
distance was apphed to compute the distance among items in this 
study. 

Supporting Information 

Figure SI Clustering of proteins based on the 16 
important features found in this study. The key discovered 
features (obtained by attribute weighting and decision tree models) 
are highly accurate in prediction and separation of HA subty-pes 
and reinforces the high capability of discovered features as 
predictors of HA subtyping. The features include: Non-reduced 
cysteines Ext, Count of Isoleucine, Freq of Cysteine, Freq of 
Aspartic Acid, Freq of Glutamic Acid, Freq of Glutamine, Freq of 
Arginine, Freq of Tyrosine, Percentage of Histidine, Percentage of 
Methionin, Percentage of Tryptophan, Percentage of Tyrosine, 
Count of Phe-Met, Count of Asn-Met, Freq of Pro-Gly, and Freq 
of Trp-Leu. 
(TIF) 

Table SI The original cleaned protein feature data set 
for HA sequences (Fcdb). This database contains 7338 protein 
sequences and 893 protein features. Table SI is shared by 
"Googledrive" at the following link: https://drive.google.com/ 
ffle/d/0B2Npj-saFbgeNjhwRTJubFRJdFk/edit?usp = sharing. 
(DOCX) 

Table S2 Distribution of important found protein 
attributes (according to attribute weighting algorithms) 
in the structure of HA subtypes. The features include: 
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frequency of Gin, The frequency of Tyr, Percentage of Tyr, 
Percentage of Cys, The frequency of Glu, Percentage of Try, 
Count of He, The frerjuenty of Arg, Percentage of His, The 
frequency of Asp, Percentage of Met, Non-reduced Cys extinction 
coeflicient at 280nm, and The frequency of Phe. 
PCLSX) 

Table S3 Univariate Analysis of Variance (ANOVA) and 
Multivariate Analyses of Variance (MANOVA) of the 
effects of important amino acid attributes (selected by 
attribute weighting algorithms) and some of the impor- 
tant dipeptides (selected by decision tree) models on 
determining HA subtypes (H1-H16). 
(DOCX) 

Table S4 Weighting algorithms used for selecting the 
most important protein features distinguishing different 
influenza HA subtypes [values (weights) closer to 1 
shows higher effectiveness of feature]. Weighting algo- 
rithms were weighting by PCA, weighting by Relief, weighting by 
Uncertainty, weigiiting b}' Gini index, weighting by Chi Squared, 
weighting by Deviation, weighting by Rule, weighting by 
Correlation, weighting by Gain Ratio, and weighting by 
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